---
title: Troubleshooting
description: Troubleshooting guide for common issues when running self-hosted Opik deployments.
---

This guide covers common troubleshooting scenarios for self-hosted Opik deployments.

## Common Issues

### ClickHouse Zookeeper Metadata Loss

#### Problem Description

If Zookeeper loses the metadata paths for ClickHouse tables, you will see coordination exceptions in the ClickHouse logs and potentially in the opik-backend service logs. These errors indicate that Zookeeper cannot find table metadata paths.

**Symptoms:**

Error messages appearing in ClickHouse logs and propagating to opik-backend service:

```
Code: 999. Coordination::Exception: Coordination error: No node, path /clickhouse/tables/0/default/DATABASECHANGELOG/log. (KEEPER_EXCEPTION)
```

This indicates that Zookeeper has lost the metadata paths for one or more ClickHouse tables.

#### Resolution Steps

Follow these steps to restore ClickHouse table metadata in Zookeeper:

##### 1. Clean Zookeeper Paths (If Needed)

If only some table paths are missing in Zookeeper, you'll need to delete the existing paths manually. Connect to the Zookeeper pod and use the Zookeeper CLI:

```bash
# Connect to Zookeeper pod
kubectl exec -it cometml-production-opik-zookeeper-0 -- zkCli.sh -server localhost:2181

# Delete all ClickHouse table paths
deleteall /clickhouse/tables
```

<Callout intent="warning">
  **Warning**: This operation removes all table metadata from Zookeeper. Proceed with caution.
</Callout>

##### 2. Restart ClickHouse

Restart the ClickHouse pods so they become aware that Zookeeper no longer has the metadata:

```bash
kubectl rollout restart statefulset/chi-opik-clickhouse-cluster-0-0
```

##### 3. Restore Replica Definitions

Connect to the first ClickHouse replica and restore the replica definitions for each table:

```bash
# Connect to the first ClickHouse replica
kubectl exec -it chi-opik-clickhouse-cluster-0-0-0 -- clickhouse-client
```

<Callout intent="warning">
  **Important**: The Opik schema name is typically `opik` but may vary depending on your installation. Before proceeding, verify your schema name by running `SHOW DATABASES;` in ClickHouse and identifying the Opik database. Use that database name in all subsequent commands.
</Callout>

Run the `SYSTEM RESTORE REPLICA` command for each table:

```sql
-- Restore system tables
SYSTEM RESTORE REPLICA default.DATABASECHANGELOG;
SYSTEM RESTORE REPLICA default.DATABASECHANGELOGLOCK;

-- Verify your Opik database name
SHOW DATABASES;

-- List all Opik tables (replace 'opik' with your actual schema name if different)
USE opik;
SHOW TABLES;

-- Restore each Opik table
SYSTEM RESTORE REPLICA opik.attachments;
SYSTEM RESTORE REPLICA opik.automation_rule_evaluator_logs;
SYSTEM RESTORE REPLICA opik.comments;
SYSTEM RESTORE REPLICA opik.dataset_items;
SYSTEM RESTORE REPLICA opik.experiment_items;
SYSTEM RESTORE REPLICA opik.experiments;
SYSTEM RESTORE REPLICA opik.feedback_scores;
SYSTEM RESTORE REPLICA opik.guardrails;
SYSTEM RESTORE REPLICA opik.optimizations;
SYSTEM RESTORE REPLICA opik.project_configurations;
SYSTEM RESTORE REPLICA opik.spans;
SYSTEM RESTORE REPLICA opik.traces;
SYSTEM RESTORE REPLICA opik.trace_threads;
SYSTEM RESTORE REPLICA opik.workspace_configurations;
```

<Callout intent="info">
  **Note**: The exact list of tables may vary depending on your Opik version. Use the `SHOW DATABASES;` or `\d` command to list all tables in your database and restore each one.
</Callout>

##### 4. Restart ClickHouse Again

Restart ClickHouse again to ensure it:
- Re-establishes connections to Zookeeper
- Verifies and synchronizes the newly restored metadata
- Automatically resumes normal replication operations

```bash
kubectl rollout restart statefulset/chi-opik-clickhouse-cluster-0-0
```

##### 5. Validate the Recovery

After the restart completes, verify that the replica status is healthy:

```sql
-- Check table creation
SHOW CREATE TABLE opik.attachments;

-- Verify replica status
SELECT table, is_readonly, replica_is_active, zookeeper_exception
FROM system.replicas;
```

**Expected Results:**
- `is_readonly = 0` (table is writable)
- `replica_is_active = 1` (replica is active)
- `zookeeper_exception = ''` (no exceptions)

You can also verify from the Zookeeper side:

```bash
# Connect to Zookeeper CLI
kubectl exec -it cometml-production-opik-zookeeper-0 -- zkCli.sh -server localhost:2181

# List tables (example path - adjust for your database name)
ls /clickhouse/tables/0/<database_name>/<table_name>
```

## Diagnostic Commands

### Connecting to ClickHouse

Connect directly to ClickHouse pods for diagnostics:

```bash
# Connect to first replica
kubectl exec -it chi-opik-clickhouse-cluster-0-0-0 -- clickhouse-client

# Connect to second replica (if running multiple replicas)
kubectl exec -it chi-opik-clickhouse-cluster-0-1-0 -- clickhouse-client
```

### Connecting to Zookeeper

Connect directly to Zookeeper pods:

```bash
# Connect to Zookeeper pod
kubectl exec -it cometml-production-opik-zookeeper-0 -- bash

# Run Zookeeper client commands
zkCli.sh -server localhost:2181
```

Common Zookeeper commands:

```bash
# List tables in Zookeeper
kubectl exec -it cometml-production-opik-zookeeper-0 -- \
  zkCli.sh -server localhost:2181 ls /clickhouse/tables/0/opik

# Remove a specific table from Zookeeper
kubectl exec -it cometml-production-opik-zookeeper-0 -- \
  zkCli.sh -server localhost:2181 \
  deleteall /clickhouse/tables/0/opik/optimizations
```

## Prevention and Best Practices

To avoid Zookeeper metadata loss issues:

1. **Regular Backups**: Implement regular backups of ClickHouse data. See the [Advanced ClickHouse Backup](/docs/opik/self-host/backup) guide for details.

2. **Monitoring**: Set up monitoring for Zookeeper health and ClickHouse replica status. Alert on `zookeeper_exception` in `system.replicas`.

3. **Resource Allocation**: Ensure Zookeeper has adequate resources (CPU, memory, disk) to maintain metadata reliably.

4. **Persistent Storage**: Use persistent volumes for Zookeeper to prevent data loss during pod restarts.

5. **Replica Validation**: Regularly check replica status with the diagnostic queries above.

## Getting Help

If you continue to experience issues after following this guide:

1. Check the [Opik GitHub Issues](https://github.com/comet-ml/opik/issues) for similar problems
2. Review ClickHouse and Zookeeper logs for additional error details
3. Open a new issue on GitHub with:
   - **Opik versions**:
     - Backend version (opik-backend)
     - Frontend version (opik-frontend)
     - Helm chart version (if deployed via Helm)
   - ClickHouse version
   - Zookeeper version
   - Error logs from all services (ClickHouse, Zookeeper, opik-backend)
   - Steps taken to reproduce the issue

